Introduction

The health-care sector is by far the most consequential part of the US economy. It is an important part of people’s lives, supporting their health and well-being. One Interesting fact is that the health-sector employs around 11% of American workers. So, for a well-functioning health-care sector is a need of well-functioning economy. Unfortunately, some problems with US health-care is substantial. It is because US spends more than any other country in terms of GDP even though the cost per capita that a person spends on average in US is also the highest. The historical studies suggests that USA spends around 10-14% of GDP in the health care. So this is kind of one motivation for us to study the problem and get to know what may be the specific reason for this and we are going to do some analysis to uncover the problem with the help of Statistics.

To realize their profit, insurance companies must charge a higher premium than the amount paid to the insured. For this reason, insurance companies invest a lot of time, effort, and money in creating models that are able to accurately predict health care costs. In order to fulfill this mission, we will first analyze the factors that influence medical loads and secondly try to build an adequate model and optimize its performance.

While using our data for regression tests both linear and multiple linear, we see which variables affect the medical cost and which do not. Using the Medical Cost Personal database, we discover that the variables age and bmi are continuous variables; the variables sex, smoker and region are categorical variables. With the help of statistics we see that there’re no variables missing from the database. By running through each variable it will help us interpret the model later down the line to see which variables affect the medical costs. During this process we might encounter problems with the categorical variables.

Medical Cost Personal Dataset

This data is taken from Kaagle which was primarily collected by Brent Lantz for his book “Machine Learning with R” which was later made public.

Variables information:

1.Age: the age of the insured (recipients). 2.Sex: sex of insured persons; “male” or “female”. 3.bmi: body mass index, providing an understanding of the body, relatively high or low weights relative to height, 4.children: number of children covered by health insurance / number of dependents. 5.smoker: does the insured smoke or not. 6.region: the recipient’s residential area in the United States; northeast, southeast, southwest, northwest. 7.charges: Individual medical costs billed by health insurance.

We can see that the varibles Sex,Smoker,Children and Region seems to be categorical.

Loading Data
insurance_data <- read.csv("/Users/harshpatel/Desktop/insurance.csv")
Summary of the data
str(insurance_data)
## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : chr  "female" "male" "male" "male" ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : chr  "yes" "no" "no" "no" ...
##  $ region  : chr  "southwest" "southeast" "southeast" "northwest" ...
##  $ charges : num  16885 1726 4449 21984 3867 ...
summary(insurance_data)
##       age            sex                 bmi           children    
##  Min.   :18.00   Length:1338        Min.   :15.96   Min.   :0.000  
##  1st Qu.:27.00   Class :character   1st Qu.:26.30   1st Qu.:0.000  
##  Median :39.00   Mode  :character   Median :30.40   Median :1.000  
##  Mean   :39.21                      Mean   :30.66   Mean   :1.095  
##  3rd Qu.:51.00                      3rd Qu.:34.69   3rd Qu.:2.000  
##  Max.   :64.00                      Max.   :53.13   Max.   :5.000  
##     smoker             region             charges     
##  Length:1338        Length:1338        Min.   : 1122  
##  Class :character   Class :character   1st Qu.: 4740  
##  Mode  :character   Mode  :character   Median : 9382  
##                                        Mean   :13270  
##                                        3rd Qu.:16640  
##                                        Max.   :63770

Preliminary Data Analysis

sum(is.na(insurance_data))
## [1] 0

We found that there is no missing values so can go ahead with the original dataset.

Box plots for categorical variables
library(ggplot2)
library(gridExtra)

plot.sex <- ggplot(insurance_data, aes(x = sex, y = charges)) +
 geom_boxplot()

plot.smoker <- ggplot(insurance_data, aes(x = smoker, y = charges)) +
 geom_boxplot()

plot.child <- ggplot(insurance_data, aes(x = as.factor(children), y = charges)) +
 geom_boxplot()

plot.region <- ggplot(insurance_data, aes(x = region, y = charges)) +
 geom_boxplot()

grid.arrange(plot.sex, plot.smoker, plot.child, plot.region, ncol=2, nrow=2)

The first boxplot (left upper corner) shows us that females and males pay on average the same charges. When looking at the second boxplot (right upper corner) we see that smokers pay higher charges compared to non smokers. Also people with more children pay more charges and it seems that the region has not an influence on the charges. In all instances the charges have a skewed distribution.

Converting the categorical variables to be treated as factor
insurance_data$sex=as.factor(insurance_data$sex)
insurance_data$smoker=as.factor(insurance_data$smoker)
insurance_data$children=as.factor(insurance_data$children)
insurance_data$region=as.factor(insurance_data$region)
str(insurance_data)
## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: Factor w/ 6 levels "0","1","2","3",..: 1 2 4 1 1 1 2 4 3 1 ...
##  $ smoker  : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ region  : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
##  $ charges : num  16885 1726 4449 21984 3867 ...
source('/Users/harshpatel/Desktop/pairs.r') 
pairs(insurance_data[,c(1,3,4,7)],panel=panel.smooth,diag.panel=panel.hist,lower.panel=panel.cor)

We need to transform the intended categorical variables to be treated as factor because some of the variables were non-numeric so they needed to be converted for the analysis and to visualize the pairplot.As per our understanding the response variable is not to be transformed but looking at the pairplot it seems that taking some variables as polynomial may provide more significance and higher model accuracy.

Preliminary model with all the variables

So for the preliminary model we tried fitting the simple linear model taking all the variables to predict the response variable that is charges.

mod1<-lm(charges~age+sex+bmi+children+smoker+region,data=insurance_data)
summary(mod1)
## 
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker + 
##     region, data = insurance_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11689.4  -2902.6   -943.7   1492.2  30042.7 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11927.17     993.66 -12.003  < 2e-16 ***
## age                257.19      11.91  21.587  < 2e-16 ***
## sexmale           -128.16     332.83  -0.385 0.700254    
## bmi                336.91      28.61  11.775  < 2e-16 ***
## children1          390.98     421.35   0.928 0.353619    
## children2         1635.78     466.67   3.505 0.000471 ***
## children3          964.34     548.10   1.759 0.078735 .  
## children4         2947.37    1239.16   2.379 0.017524 *  
## children5         1116.04    1456.02   0.767 0.443514    
## smokeryes        23836.41     414.14  57.557  < 2e-16 ***
## regionnorthwest   -380.04     476.56  -0.797 0.425318    
## regionsoutheast  -1033.14     479.14  -2.156 0.031245 *  
## regionsouthwest   -952.89     478.15  -1.993 0.046483 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6059 on 1325 degrees of freedom
## Multiple R-squared:  0.7519, Adjusted R-squared:  0.7497 
## F-statistic: 334.7 on 12 and 1325 DF,  p-value: < 2.2e-16

Looking at the summary of the model, it can be found that lot of variables are not significant which can be given as sex, children, region north west. It is surprising that though a lot of variables are not significant, we got a decent R2 value and the F-statistic also seems high so overall the model performed well. But this might be because some variables might be collinear.

Checking the assumption of multicollinearity on the preliminary model.

library(car)
## Loading required package: carData
# VIF
vif(mod1)
##              GVIF Df GVIF^(1/(2*Df))
## age      1.020607  1        1.010251
## sex      1.009334  1        1.004656
## bmi      1.108836  1        1.053013
## children 1.024871  5        1.002460
## smoker   1.018024  1        1.008972
## region   1.109919  3        1.017533
# tolerance
1/vif(mod1)
##               GVIF        Df GVIF^(1/(2*Df))
## age      0.9798087 1.0000000       0.9898529
## sex      0.9907520 1.0000000       0.9953653
## bmi      0.9018469 1.0000000       0.9496562
## children 0.9757323 0.2000000       0.9975463
## smoker   0.9822954 1.0000000       0.9911081
## region   0.9009668 0.3333333       0.9827690
# mean VIF
mean(vif(mod1))
## [1] 1.354915

A VIF larger than 10 indicates multicolinearity. We have no VIF larger than 10, so this indicates no multicollinearity. A tolerance lower 0.2 indicates problems. In our cases the tolerance levels are all higher than 0.2, thus no problems here. Also the mean VIF is around 1, so no bias here.

Checking for the assumptions of residuals.

plot(mod1)

We have plotted four graphs. The first graph shows the fitted values against the observed residuals. In this plot the dots should be randomly placed around the horizontal zero line, which is clearly not the case. It seems that there are three groups present. Thus per group there is difference in variance across the residuals.

There is increasing variance across the residuals (heteroscedasticity), and there might also be a non-linear relationship between the outcome and the predictor. This non-linear relationship is also reflected by the second plot (Q-Q plot), since the dots deviate from the dotted line.

Checking Polynomial relationship

library(ggplot2)
library(gridExtra)
plot.age <- ggplot(insurance_data, aes(x = age, y = charges)) +
 geom_point()

plot.age2 <- ggplot(insurance_data, aes(x = age^2, y = charges)) +
 geom_point()

grid.arrange(plot.age, plot.age2, ncol=2)

When looking at the first image, we can see a slight U shape while in the seciond image which is age^2, we can see a linear trend so we can use age^2.

library(ggplot2)
library(gridExtra)
plot.age <- ggplot(insurance_data, aes(x = bmi, y = charges)) +
 geom_point()

plot.age2 <- ggplot(insurance_data, aes(x = bmi^2, y = charges)) +
 geom_point()

grid.arrange(plot.age, plot.age2, ncol=2)

By plotting bmi it doesn’t seem to have a polynomial degree as the samples are spread out so we will not use any kind of polynomial on the variable bmi.

Model with polynomial
mod2 <- lm(charges ~ smoker + poly(age,2)  + bmi + children + region, data = insurance_data)
summary(mod2)
## 
## Call:
## lm(formula = charges ~ smoker + poly(age, 2) + bmi + children + 
##     region, data = insurance_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12630.8  -2869.6   -908.2   1436.3  31028.6 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -2065.86     915.35  -2.257  0.02418 *  
## smokeryes        23835.48     410.29  58.094  < 2e-16 ***
## poly(age, 2)1   131863.23    6080.33  21.687  < 2e-16 ***
## poly(age, 2)2    27045.69    6499.70   4.161 3.37e-05 ***
## bmi                331.94      28.42  11.681  < 2e-16 ***
## children1          933.81     438.63   2.129  0.03344 *  
## children2         2231.83     485.34   4.599 4.66e-06 ***
## children3         1480.80     558.61   2.651  0.00812 ** 
## children4         3501.60    1238.42   2.827  0.00476 ** 
## children5         1782.50    1455.56   1.225  0.22094    
## regionnorthwest   -404.00     473.54  -0.853  0.39372    
## regionsoutheast  -1042.73     476.07  -2.190  0.02868 *  
## regionsouthwest   -950.81     475.08  -2.001  0.04556 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6020 on 1325 degrees of freedom
## Multiple R-squared:  0.7551, Adjusted R-squared:  0.7529 
## F-statistic: 340.5 on 12 and 1325 DF,  p-value: < 2.2e-16

Checking interaction terms.

Quantitative predictors: bmi,age Categorical predictors: sex,smoker,children,region

library(car)
scatterplot( charges~bmi | smoker,data=insurance_data) #influence of smoker 

scatterplot( charges~bmi |sex ,data=insurance_data) #influence of sex

scatterplot( charges~bmi | region,data=insurance_data) #influence of smoker 

By Plotting the categorical variables to see the interaction effect with bmi, it was found that all the categorical variables shows interaction effect so we’ll use interaction of bmi with sex,smoker and region.

scatterplot( charges~age | sex,data=insurance_data) #influence of smoker 

scatterplot( charges~age | smoker,data=insurance_data) #influence of smoker 

scatterplot( charges~age | region,data=insurance_data) #influence of smoker 

While plotting age with the categorical variables it was found that all those has a additive effect with age so its better not to use any kind of interaction with age.

Model with interaction to only bmi
mod3=lm(charges~bmi*sex,data=insurance_data)
summary(mod3)
## 
## Call:
## lm(formula = charges ~ bmi * sex, data = insurance_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -23378  -8115  -3819   4773  48914 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3543.81    2362.31   1.500 0.133813    
## bmi           297.12      76.27   3.896 0.000103 ***
## sexmale     -4349.36    3328.11  -1.307 0.191487    
## bmi:sexmale   179.96     106.49   1.690 0.091273 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11860 on 1334 degrees of freedom
## Multiple R-squared:  0.0437, Adjusted R-squared:  0.04155 
## F-statistic: 20.32 on 3 and 1334 DF,  p-value: 7.029e-13

We can still see that in the model, sex is still not significant and the interaction as well so it is better not to use this interaction.

mod4=lm(charges~bmi*smoker,data=insurance_data)
summary(mod4)
## 
## Call:
## lm(formula = charges ~ bmi * smoker, data = insurance_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19768.0  -4400.7   -869.5   2957.7  31055.9 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     5879.42     976.87   6.019 2.27e-09 ***
## bmi               83.35      31.27   2.666  0.00778 ** 
## smokeryes     -19066.00    2092.03  -9.114  < 2e-16 ***
## bmi:smokeryes   1389.76      66.78  20.810  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6161 on 1334 degrees of freedom
## Multiple R-squared:  0.7418, Adjusted R-squared:  0.7412 
## F-statistic:  1277 on 3 and 1334 DF,  p-value: < 2.2e-16

We can see that the model with only the interaction og bmi and smoker performs well and gives us around 74% accuracy while the other doesn’t perform well so now we will try to perform a model with most significant interaction.

Full Model with all significant interactions.
mod5=lm(charges~bmi+bmi:smoker+smoker+poly(age,2)+region+children,data=insurance_data)
summary(mod5)
## 
## Call:
## lm(formula = charges ~ bmi + bmi:smoker + smoker + poly(age, 
##     2) + region + children, data = insurance_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14933.9  -1721.5  -1242.0   -707.9  31279.6 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       7773.51     813.49   9.556  < 2e-16 ***
## bmi                 17.45      25.40   0.687 0.492332    
## smokeryes       -20173.04    1633.65 -12.348  < 2e-16 ***
## poly(age, 2)1   135664.02    4854.81  27.944  < 2e-16 ***
## poly(age, 2)2    25039.38    5188.06   4.826 1.55e-06 ***
## regionnorthwest   -617.73     378.02  -1.634 0.102471    
## regionsoutheast  -1212.13     380.01  -3.190 0.001457 ** 
## regionsouthwest  -1223.21     379.30  -3.225 0.001291 ** 
## children1          820.24     350.10   2.343 0.019284 *  
## children2         2080.05     387.40   5.369 9.32e-08 ***
## children3         1483.40     445.84   3.327 0.000901 ***
## children4         3917.29     988.53   3.963 7.80e-05 ***
## children5         2429.44    1161.95   2.091 0.036734 *  
## bmi:smokeryes     1434.08      52.15  27.497  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4805 on 1324 degrees of freedom
## Multiple R-squared:  0.8441, Adjusted R-squared:  0.8426 
## F-statistic: 551.5 on 13 and 1324 DF,  p-value: < 2.2e-16

We conclude that when comparing the interaction model with the regular model, the interaction model performs well giving an accuracy of around 84% and fitting the values well which we can see in the summary that almost all the variables have significant p-value which strengthens our claim and also the R2 value is higher.

Outlier Analysis

library(car)
influencePlot(mod5)

##         StudRes         Hat        CookD
## 439  -0.0898323 0.064102618 3.951027e-05
## 517   5.0746727 0.008701754 1.585062e-02
## 1013  4.0151674 0.047387259 5.663598e-02
## 1048 -3.2039227 0.052248865 4.014117e-02
## 1086  0.9727164 0.074225481 5.418884e-03
## 1301  6.6439458 0.008535099 2.628631e-02

By plotting the influence plot, it was found that 5 obs were considered as influential which can be given by 517,578,861,1048,1301. This analysis does not mean that we should immediately delete the observations but rather we should see why they are influential and we can do that by different influence measures which we will try next. While looking at the influence plot at first glance we can see that 1048,861 seems to be outliers as they are far from the other values.

par(mfrow=c(2,2))
plot(mod5,1)
plot(mod5,3)
plot(mod5,5)
plot(mod5,6)

Now we have found out the influential points and we have to do something to handle those. The best way is to drop those rows having influential points or outliers as we have a large dataset and we can do this for a number of reasons. To mention few, the mean of the data is affected due to an outliers and also we can find a drop in correlation if there is one. And we need to remove the influential points because by doing that we can improve the fitting of a regression equation and get a better fit and predicted values. So we try to remove those influential points and try to see if the model accuracy changes or not.

new_data<-insurance_data[-c(439,517,1013,1048,1086,1301)]
model_woi<-lm(charges~bmi+bmi:smoker+smoker+poly(age,2)+region+children,data=new_data)
summary(model_woi)
## 
## Call:
## lm(formula = charges ~ bmi + bmi:smoker + smoker + poly(age, 
##     2) + region + children, data = new_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14933.9  -1721.5  -1242.0   -707.9  31279.6 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       7773.51     813.49   9.556  < 2e-16 ***
## bmi                 17.45      25.40   0.687 0.492332    
## smokeryes       -20173.04    1633.65 -12.348  < 2e-16 ***
## poly(age, 2)1   135664.02    4854.81  27.944  < 2e-16 ***
## poly(age, 2)2    25039.38    5188.06   4.826 1.55e-06 ***
## regionnorthwest   -617.73     378.02  -1.634 0.102471    
## regionsoutheast  -1212.13     380.01  -3.190 0.001457 ** 
## regionsouthwest  -1223.21     379.30  -3.225 0.001291 ** 
## children1          820.24     350.10   2.343 0.019284 *  
## children2         2080.05     387.40   5.369 9.32e-08 ***
## children3         1483.40     445.84   3.327 0.000901 ***
## children4         3917.29     988.53   3.963 7.80e-05 ***
## children5         2429.44    1161.95   2.091 0.036734 *  
## bmi:smokeryes     1434.08      52.15  27.497  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4805 on 1324 degrees of freedom
## Multiple R-squared:  0.8441, Adjusted R-squared:  0.8426 
## F-statistic: 551.5 on 13 and 1324 DF,  p-value: < 2.2e-16

We can see that by deleting the influential points, there is no effect on the model that we tried earlier.

Performing Cross Validation

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::combine() masks gridExtra::combine()
## x dplyr::filter()  masks stats::filter()
## x dplyr::lag()     masks stats::lag()
## x dplyr::recode()  masks car::recode()
## x purrr::some()    masks car::some()
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
set.seed(123)

random_sample <- createDataPartition(insurance_data $ charges, p = 0.7, list = FALSE) #Dividing the data in 70:30 ratio 

training_dataset  <- insurance_data[random_sample, ]

testing_dataset <- insurance_data[-random_sample, ]

mod6 <- lm(charges~bmi+bmi:smoker+smoker+poly(age,2)+region+children,data=training_dataset)  # cross-validation model

predictions<-predict(mod6,testing_dataset)

data.frame( R2 = R2(predictions, testing_dataset $ charges),
            RMSE = RMSE(predictions, testing_dataset $ charges),
            MAE = MAE(predictions, testing_dataset $ charges))
##          R2     RMSE      MAE
## 1 0.8260944 4969.245 2913.679

We can see that the R2 score for the model is 82% which is lower than the regular model without cross-validation but the model with cross-validation is more accurate.

Model Selection

We will try to find the best possible subset for better accuracy of the model using a ols_step_best_subset which is a function that gives us all possible model with different variables and it corresponding Metrics from which we can determine which variables are significant and whioch are not. Also we can compare the diffent models without hard coeding it using Forward and Backward selection procedures. I decided to go with this because when I was trying to run the step function, it was giving an error saying “function is masked by some different library”. So I decided to do something different.

library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_step_best_subset(mod6)
##                      Best Subsets Regression                     
## -----------------------------------------------------------------
## Model Index    Predictors
## -----------------------------------------------------------------
##      1         bmi:smoker                                         
##      2         poly(age, 2) bmi:smoker                            
##      3         smoker poly(age, 2) bmi:smoker                     
##      4         smoker poly(age, 2) children bmi:smoker            
##      5         smoker poly(age, 2) region children bmi:smoker     
##      6         bmi smoker poly(age, 2) region children bmi:smoker 
## -----------------------------------------------------------------
## 
##                                                                   Subsets Regression Summary                                                                  
## --------------------------------------------------------------------------------------------------------------------------------------------------------------
##                        Adj.        Pred                                                                                                                        
## Model    R-Square    R-Square    R-Square      C(p)         AIC            SBIC           SBC              MSEP               FPE            HSP         APC  
## --------------------------------------------------------------------------------------------------------------------------------------------------------------
##   1        0.7396      0.7390      0.7378    685.6017    19060.7565    -177919.3362    19080.1315    36483081884.2513    39019161.3318    41642.9324    0.2615 
##   2        0.8286      0.8278      0.8267    134.1045    18672.5141    -433475.9918    18701.5766    24040738858.3882    25766843.8666    27499.6752    0.1725 
##   3        0.8456      0.8448      0.8433     30.3199    18576.4929    -537576.9943    18610.3991    21678498109.0562    23259714.7342    24824.1665    0.1557 
##   4        0.8504      0.8488      0.8466      2.5702    18556.9750    -572678.2605    18615.1000    21029461104.5978    22684780.0981    24210.8372    0.1512 
##   5        0.8514      0.8493      0.8467     -2.0000    18556.3288    -580039.6158    18628.9851    20903412197.1657    22621710.8844    24143.8548    0.1505 
##   6        0.8514      0.8493      0.8467      0.0000    18556.3288    -578939.8869    18628.9851    20925888984.4741    22645674.5612    24169.8159    0.1508 
## --------------------------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria 
##  SBIC: Sawa's Bayesian Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
##  MSEP: Estimated error of prediction, assuming multivariate normality 
##  FPE: Final Prediction Error 
##  HSP: Hocking's Sp 
##  APC: Amemiya Prediction Criteria

By performing this, we conclude that the full model 5 and model 6 works best compared to other models and if we go with the KISS principle we can just go with the model 5 as it has 1 less variable and the same accuracy as model 6 which makes it less compelx.

Now we’ll try to run different metrics on the model we created before to see what model works best for us.

source('/Users/harshpatel/Downloads/modelselfunctions.r')
model1<-Criteria(mod1,FullMSE = 10,label=T)
model2<-Criteria(mod2,FullMSE = 10,label=T)
model3<-Criteria(mod3,FullMSE = 10,label=T)
model4<-Criteria(mod4,FullMSE = 10,label=T)
model5<-Criteria(mod5,FullMSE = 10,label=T)
model6<-Criteria(mod6,FullMSE = 10,label=T)

df<-rbind(model1,model2,model3,model4,model5,model6)
df
##        p+1  R2adj          Cp      AIC        PRESS
## model1  13 0.7497  4863906994 23318.92  49673726637
## model2  13 0.7529  4801704879 23301.70  49040471884
## model3   4 0.0416 18750487809 25106.39 188739016540
## model4   4 0.7412  5063202800 23354.65  50976180503
## model5  14 0.8426  3056364877 22699.27  31255560631
## model6  14 0.8493  2076959713 15892.40  21436878626

After running different types of metrics, we conclude that some model are fitting similar but according to all metrics, we can see model 6 as the best model which is the onw with cross-validation.

Conclusion

By performing different operations like interactions, polynomial terms, multiple regression, outlier analysis, cross validation, we can conclude that the model with all of these tends to perform well on the data and also the model with cross-validation also performs well on the testing data which makes it more accurate. So the best r2 score that we got during our analysis was around 84%. For further improvement of the accracy as well as better prediction we can try a number of Machine Learning algorithms such as KNN, Random Forest, SVM,etc that will give better prediction accuracy but to the end we can say that the Regression method which is the simplest of all works pretty well.

Literature Review

So we as mentioned in the introduction, the healthcare costs in US are more than anywhere in the world and to study this, there are a ton of literature out there. But we try to study the one written by David.W.Bates at the health affairs forum and the paper goes by “Big Data In Healthcare”.

The US medical care framework is quickly taking on electronic well being records, which will drastically expand the amount of clinical information that are accessible electronically. All the while, fast advancement has been made in clinical examination—strategies for dissecting huge amounts of information and gathering new bits of knowledge from that investigation—which is important for what is known as big data.

A health insurance company can only make money if it collects more than it spends on the medical care of its beneficiaries. On the other hand, even though some conditions are more prevalent for certain segments of the population, medical costs are difficult to predict since most money comes from rare conditions of the patients.

So the objective is to accurately predict insurance costs based on people’s data, including age, Body Mass Index, smoking or not, etc. Additionally, we will also determine what the most important variable influencing insurance costs is. These estimates could be used to create actuarial tables that set the price of yearly premiums higher or lower according to the expected treatment costs which is mainly considered as regression problem.